Rapid Distance-Based Outlier Detection via Sampling

نویسندگان

  • Mahito Sugiyama
  • Karsten M. Borgwardt
چکیده

Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Nonparametric Depth-Based Multivariate Outlier Identifiers, and Masking Robustness Properties

In extending univariate outlier detection methods to higher dimension, various issues arise: limited visualization methods, inadequacy of marginal methods, lack of a natural order, limited parametric modeling, and, when using Mahalanobis distance, restriction to ellipsoidal contours. To address and overcome such limitations, we introduce nonparametric multivariate outlier identifiers based on m...

متن کامل

Detection of Outlier Patches in Autoregressive Time Series

This paper proposes a procedure to detect patches of outliers in an autoregressive process. The procedure is an improvement over the existing detection methods via Gibbs sampling. We show that the standard outlier detection via Gibbs sampling may be extremely ine cient in the presence of severe masking and swamping e ects. The new procedure identi es the beginning and end of possible outlier pa...

متن کامل

Outlier Detection for Support Vector Machine using Minimum Covariance Determinant Estimator

The purpose of this paper is to identify the effective points on the performance of one of the important algorithm of data mining namely support vector machine. The final classification decision has been made based on the small portion of data called support vectors. So, existence of the atypical observations in the aforementioned points, will result in deviation from the correct decision. Thus...

متن کامل

Investigating Outliers Detection Methods for the Iranian Manufacturing Establishment Survey Data

The role and importance of the industrial sector in the economic development specify the necessity of having accurate and timely data for exact planning. As outliers data in establishment surveys are common due to the structure of the economy, the evaluation of survey data by identifying and investigating outliers prior to the release of data is necessary. In this paper the practical applicatio...

متن کامل

Outlier Detection Using Distributed Mining Technology In Large Database

In many data analysis tasks, a large number of variables are being recorded or sampled. One of the first steps towards obtaining a coherent analysis is the detection of outlaying observations. A distributed approach is presented to detect distance-based outliers, based on the concept of outlier detection solving set. Data objects, which are different from or inconsistent with the remaining set ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013